In this work authors attempted to check if there is a correlation between the education finances,household income affect the unemployment rate. The results didn’t indicate any strong relation, what confirms the findings that the unemployment rate behaves without any visible pattern. Because of that this work is focuesd on exploring relations between the education finances, household income and unemployment rate.
Starting from three different data sets, A new data frame is created for this study, based on a selection From the previous data sets. Median income in current dollars is picked from the Mean and Median data set. From the unemployment data set, the Unemployment rate by stated is included. From the U.S. Educational Finances, the variables kept for this study are the following:
Additionally, the data set is normalized for the values of States and year.
After cleaning the dataset we proceed to explore it visually.
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Analyzing the plot, it is clear that there is some correlation between the two variables. Median Income has been increasing for the last two decades meanwhile, unemployment suffers from spikes that could be blamed on certain periods of economic disarray, e.g the 2008 economic crisis impact can be noticed in the 2010 spike. There is a slight decrease in the median household income during the same time. Overall, when the income improves, the unemployment rate tend to increase,
Similar to the previous segment, unemployment rate increases the most when the capital spent in education decreases. This doesn’t mean causation. The employment suffers the most when the economy is suffering. Same goes for the amount of money spent in education. The decrease that can be seen during 2010 can simply be due to the governments affording less capital to spend in education.
This plot is a ranking of the states from the highest to the lowest unemployment rate. This is done by calculating the average of this variable throught the years.
Similar to the previous one, this plot is a ranking of the states. In this case they are ranked based on the average median household income in the last two decades.
Each dot represents a state. The size of the dots are calculated using the total capital revenue of each state, the biggest of them being California. This animation follow the progress of unemployment rate and median household income. It follows somehow the same patterns that the first graph of this analysis does. Most of the stats move forward decreasing in unemployment rate while increasing in median household income until 2010 where the unemployment rate spikes and the the income stops moving forward.
5.Animating a plot throught time to represent the total Capital spent in education and Unemployment Rate
Each dot represent a state. The size of the dots are calculated using the total capital revenue of each state, the biggest of them being California. This animation follow the progress of unemployment rate and median household income. It follows somehow the same patterns that the first graph of this analysis does. Most of the stats move forward decreasing in unemployment rate while increasing in median household income until 2010 where the unemployment rate spikes and the the income stops moving forward.
Strength of relationships
Before to analyse the strength of relationships between the variables, a new variable is added to the dataset: Profit. Profit is the difference between the Total Revenue and the Total Expenditure, so actually the value that give us information about the economic resources available in a State for the Education.
us_data$Profit <- us_data$TOTAL_REVENUE-us_data$TOTAL_EXPENDITURE
Since the question of interest is “How does education finances and household income affect the unemployment rate?”, the responce variable is The Unemployment Rate and the indipendent variables are Household Income and Profit, the new one. We suspect that these variables might affect the Unemployment rate so we checked their relationship with the response variables
ggplot(data = us_data) +
aes(x = Median.Income, y = Unemployment.Rate)+
geom_point()
ggplot(data = us_data) +
aes(x = Unemployment.Rate , y = Profit)+
geom_point()+geom_smooth(method='lm')
Looking at the scatter plot of Unemployment Rate and Median Income is possible notice no relation between them, since the data don’t follow any pattern. Conversely in the scatter plot bewteen Unemployment rate and the new variable Profit there is a pattern that seems linear. The correlation matrix confirms this behaviour, so the Median Income is drop.
numeric_subset<-select(us_data, Profit, Median.Income, Unemployment.Rate)
M<-cor(numeric_subset)
M
## Profit Median.Income Unemployment.Rate
## Profit 1.00000000 0.05677694 -0.05513304
## Median.Income 0.05677694 1.00000000 0.04451593
## Unemployment.Rate -0.05513304 0.04451593 1.00000000
The question of interest is “How does education finances and household income affect the unemployment rate?”, but since the linear regression line on the scatter plot seems affected by the choise of considering The Unemployment Rate as response variable and the Household Income as indipendent variables, the Y becomes Profit and the X becomes Unemployment Rate.
The linear regression is performed:
Profit= Beta0 + Beta1 * Unemployment Rate
mod<-lm(us_data$Profit ~ us_data$Unemployment.Rate)
summary(mod)
##
## Call:
## lm(formula = us_data$Profit ~ us_data$Unemployment.Rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5348033 -78280 72701 166371 4000620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8636 52720 -0.164 0.8699
## us_data$Unemployment.Rate -17247 8935 -1.930 0.0538 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 583500 on 1222 degrees of freedom
## Multiple R-squared: 0.00304, Adjusted R-squared: 0.002224
## F-statistic: 3.726 on 1 and 1222 DF, p-value: 0.05381
The intercept is not statistically significant, but the coefficient beta1 yes, since has a low p-value, the null hypothesis is reject, showing evidence that the unemployment rate affects the Profit. Also the F test is statistically significant.
The Rsquared it’s pretty low so it means that this model is not a good model. To check that, the next step is the validation of the assumptions on which the regression model is based on: Normality of the residuals, Homoskedasticity of the residuals and Indipendence between the residuals.
jarque.bera.test(mod$residuals)
##
## Jarque Bera Test
##
## data: mod$residuals
## X-squared = 23343, df = 2, p-value < 2.2e-16
bptest(mod)
##
## studentized Breusch-Pagan test
##
## data: mod
## BP = 18.848, df = 1, p-value = 1.416e-05
dwtest(mod)
##
## Durbin-Watson test
##
## data: mod
## DW = 1.9389, p-value = 0.1394
## alternative hypothesis: true autocorrelation is greater than 0
The only respected assumption is the one about equal variance between the residuals, since Durbin and Watson test is the only one that has a p value higher that the alpha level. This model is not a good model.
The next step is scaling the value since they have very different range and value and perform again the regression
us_data$new_profit=scale(us_data$Profit)
us_data$new_rate=scale(us_data$Unemployment.Rate)
mod2<-lm(us_data$new_profit ~ us_data$new_rate)
summary(mod2)
##
## Call:
## lm(formula = us_data$new_profit ~ us_data$new_rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1552 -0.1340 0.1245 0.2848 6.8486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.060e-17 2.855e-02 0.00 1.0000
## us_data$new_rate -5.513e-02 2.856e-02 -1.93 0.0538 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9989 on 1222 degrees of freedom
## Multiple R-squared: 0.00304, Adjusted R-squared: 0.002224
## F-statistic: 3.726 on 1 and 1222 DF, p-value: 0.05381
#residuals test
jarque.bera.test(mod2$residuals)
##
## Jarque Bera Test
##
## data: mod2$residuals
## X-squared = 23343, df = 2, p-value < 2.2e-16
bptest(mod2)
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 18.848, df = 1, p-value = 1.416e-05
dwtest(mod2)
##
## Durbin-Watson test
##
## data: mod2
## DW = 1.9389, p-value = 0.1394
## alternative hypothesis: true autocorrelation is greater than 0
In this work the variable “Unemployment_rate”; has been determined as a variable of interest to be able to predict if a State is going to have unemployment or not depending on its income. To do this, firstly, this variable has been transformed into two classes, “True” and “False”. True represents an unemployment rate of more than 4.3% while False represents an unemployment rate of less than 4.3%. This percentage has been determined after a previous analysis in which it can be seen that the normal is to have an unemployment of 4.3% as can be seen in the Figure below.
# plot
us_data %>%
ggplot( aes(x= YEAR, y= Unemployment.Rate)) +
geom_line(color="#69b3a2") +
ylim(0,15) +
geom_hline(yintercept=4.3, color="orange", size=.5) +
theme_ipsum()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Three supervised classification algorithms have been used to predict the unemployment rate. The first was the K-Nearest Neighbors algorithm, followed by the Logistic Regression model and finally, Classification Tree.
In order to carry out these algorithms, first you go through the phase of testing and training the data. In our case we used 80% in the training phase and 20% in the testing. The createDataPartition function is used for the training phase.
trainIndex <- createDataPartition(us_data$unemployment,
p = .8,
list = FALSE,
times = 1
)
Cross-validation is a statistical method used to estimate the skill of machine learning models. To carry out cross-validation, the function trainControl( method = “cv”) and the number of times you want to do it are defined. Afterwards, the same training and testing process as before is carried out again.
fitControl <- trainControl(
method = "cv",
number = 10
)
Before performing the training and extinguishing process, a cleaning process has been carried out, in which the missing values have been eliminated and transformed from the database so that the prediction is much clearer.
us_data <- na.omit(us_data)
The use of categorical variables in Machine Learning algorithms is necessary, so at the beginning of its implementation a transformation of the variable of interest to categorical values was performed, leaving the other variables as int or num.
unemployment <- ifelse( us_data$Unemployment.Rate >= 4.3, "True", "False")
us_finances <- mutate(us_data, unemployment)
In order to compare which model is the best, many statistical measures can be made, such as Benchmark, but in this work we have made the ConfusionMatrix, which shows the True Negative, True Positive, False Negative and False Positive of the classification method we have performed. In this way the accuracy of the model can be obtained and it can be compared with those of the other models also implemented. The kappa statistician is also a good meter to obtain the reliability of the model.
As indicated at the beginning of this section, three algorithms have been applied to predict the variable of interest. K-Nearest Neighbors, Logistic Regression and Classification Trees. At the bottom you can see how each of the algorithms has been implemented.
KNN
fit_cv_grid <- caret::train(
unemployment ~ .,
data = training_set,
method = "knn",
trControl = fitControl,
tuneGrid = grid
)
Logistic Regression
getParamSet("classif.logreg")
## Type len Def Constr Req Tunable Trafo
## model logical - TRUE - - FALSE -
learner_log <- makeLearner("classif.logreg",
predict.type = "response")
Classification Tree
getParamSet("classif.rpart")
## Type len Def Constr Req Tunable Trafo
## minsplit integer - 20 1 to Inf - TRUE -
## minbucket integer - - 1 to Inf - TRUE -
## cp numeric - 0.01 0 to 1 - TRUE -
## maxcompete integer - 4 0 to Inf - TRUE -
## maxsurrogate integer - 5 0 to Inf - TRUE -
## usesurrogate discrete - 2 0,1,2 - TRUE -
## surrogatestyle discrete - 0 0,1 - TRUE -
## maxdepth integer - 30 1 to 30 - TRUE -
## xval integer - 10 0 to Inf - FALSE -
## parms untyped - - - - TRUE -
learner_bctree <- makeLearner("classif.rpart",
predict.type = "response")
Also with the scaling, the regression has the same problems of the first one. Since in the scatter plot seems to be present a linear relationship, the next attempt to improve the model is detecting the outliers but we didn’t have success to do that.
As indicated from the beginning of the project, the objective is to predict what the unemployment rate will be depending on the average income of each of the states. Three algorithms have been implemented for this purpose. The figure below shows the correlation between average income and the variable of interest.
ggplot(data = us_data) +
geom_histogram(
mapping = aes(x = Median.Income, fill = unemployment),
alpha = .7,
position = "identity"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Before starting the analysis, it is important to clean and prepare the data.
# Define the variable unemployment as True or False. We are going to focus on the 1st Qu.
unemployment <- ifelse( us_data$unemployment_rate >= 4.3, "TRUE", "FALSE")
us_finances <- mutate(us_finances, unemployment)
# MACHINE LEARNING
task <- makeClassifTask(id = "US Finances", data = us_data,
target = "unemployment", positive = "TRUE")
task # Analysis the data
## Supervised task: US Finances
## Type: classif
## Target: unemployment
## Observations: 1224
## Features:
## numerics factors ordered functionals
## 14 1 0 2
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## FALSE TRUE
## 287 937
## Positive class: TRUE
KNN
ggplot(data = us_data) +
geom_histogram(
mapping = aes(x = Median.Income, fill = unemployment),
alpha = .7,
position = "identity"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Training
trainIndex <- createDataPartition(us_data$unemployment,
p = .8,
list = FALSE,
times = 1
)
training_set <- us_data[ trainIndex, ]
test_set <- us_data[ -trainIndex, ]
basic_fit <- caret::train(unemployment ~ ., data = training_set, method = "knn")
basic_preds <- predict(basic_fit, test_set)
fitControl <- trainControl(
method = "cv",
number = 10
)
fit_with_cv <- caret::train(
unemployment ~ .,
data = training_set,
method = "knn",
trControl = fitControl
)
fit_cv_preds <- predict(fit_with_cv, test_set)
unemployment_factor <- as.factor(test_set$unemployment)
confusionMatrix(unemployment_factor, fit_cv_preds, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 28 29
## TRUE 11 176
##
## Accuracy : 0.8361
## 95% CI : (0.7835, 0.8802)
## No Information Rate : 0.8402
## P-Value [Acc > NIR] : 0.61041
##
## Kappa : 0.4857
##
## Mcnemar's Test P-Value : 0.00719
##
## Sensitivity : 0.8585
## Specificity : 0.7179
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.4912
## Prevalence : 0.8402
## Detection Rate : 0.7213
## Detection Prevalence : 0.7664
## Balanced Accuracy : 0.7882
##
## 'Positive' Class : TRUE
##
grid <- expand.grid(k = 1:20)
fit_cv_grid <- caret::train(
unemployment ~ .,
data = training_set,
method = "knn",
trControl = fitControl,
tuneGrid = grid
)
preds_cv_grid <- predict(fit_cv_grid, test_set)
confusionMatrix(unemployment_factor, preds_cv_grid, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 36 21
## TRUE 20 167
##
## Accuracy : 0.832
## 95% CI : (0.779, 0.8766)
## No Information Rate : 0.7705
## P-Value [Acc > NIR] : 0.01162
##
## Kappa : 0.5278
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8883
## Specificity : 0.6429
## Pos Pred Value : 0.8930
## Neg Pred Value : 0.6316
## Prevalence : 0.7705
## Detection Rate : 0.6844
## Detection Prevalence : 0.7664
## Balanced Accuracy : 0.7656
##
## 'Positive' Class : TRUE
##
Classification Tree
ParamHelpers::getParamSet("classif.rpart")
## Type len Def Constr Req Tunable Trafo
## minsplit integer - 20 1 to Inf - TRUE -
## minbucket integer - - 1 to Inf - TRUE -
## cp numeric - 0.01 0 to 1 - TRUE -
## maxcompete integer - 4 0 to Inf - TRUE -
## maxsurrogate integer - 5 0 to Inf - TRUE -
## usesurrogate discrete - 2 0,1,2 - TRUE -
## surrogatestyle discrete - 0 0,1 - TRUE -
## maxdepth integer - 30 1 to 30 - TRUE -
## xval integer - 10 0 to Inf - FALSE -
## parms untyped - - - - TRUE -
learner_bctree <- mlr::makeLearner("classif.rpart",
predict.type = "response")
learner_bctree$par.set #same as getparamset
## Type len Def Constr Req Tunable Trafo
## minsplit integer - 20 1 to Inf - TRUE -
## minbucket integer - - 1 to Inf - TRUE -
## cp numeric - 0.01 0 to 1 - TRUE -
## maxcompete integer - 4 0 to Inf - TRUE -
## maxsurrogate integer - 5 0 to Inf - TRUE -
## usesurrogate discrete - 2 0,1,2 - TRUE -
## surrogatestyle discrete - 0 0,1 - TRUE -
## maxdepth integer - 30 1 to 30 - TRUE -
## xval integer - 10 0 to Inf - FALSE -
## parms untyped - - - - TRUE -
mod_bctree <- mlr::train(learner_bctree, task)
## Functional features have been converted to numerics
getLearnerModel(mod_bctree)
## n= 1224
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1224 287 TRUE (0.2344771 0.7655229)
## 2) Unemployment.Rate< 4.25 287 0 FALSE (1.0000000 0.0000000) *
## 3) Unemployment.Rate>=4.25 937 0 TRUE (0.0000000 1.0000000) *
predict_bctree <- predict(mod_bctree, task = task)
## Functional features have been converted to numerics
head(as.data.frame(predict_bctree))
conf_matrix_bctree <- calculateConfusionMatrix(predict_bctree)
conf_matrix_bctree
## predicted
## true FALSE TRUE -err.-
## FALSE 287 0 0
## TRUE 0 937 0
## -err.- 0 0 0
In all three algorithms they have presented an accuracy of 100% because they are working with a database that presents a small size of variables and the data are clean. But we can still contrast the information with the confusion matrix, which we can see in the figure below. In the upper left cell are the truly negative results which state that 291 unemployment are False out of 1224, i. e. 23. 77% of the unemployments are False and the model verifies this. While in the lower right cell are the truly positive and indicates that 933 unemployment are True and the model has classified them as True. However, in the top right cell there are false positives which indicates that the model has classified the unemployment as False while the unemployment rate was actually True, and in the cell below on the left are the false negatives, in this case we have 0 False cells that the model has classified as True.
conf_matrix_bctree <- calculateConfusionMatrix(predict_bctree)
conf_matrix_bctree
## predicted
## true FALSE TRUE -err.-
## FALSE 287 0 0
## TRUE 0 937 0
## -err.- 0 0 0
This results indicates that unemployment rate is indeed not dependant on education finances in particular states or the average household income. In this case the future work should take into consideration more variables to explore the problem from wider perspective.
Psacharopoulos, George. (2006). The Value of Investment in Education: Theory, Evidence and Policy. http://lst-iiep.iiep-unesco.org/cgi-bin/wwwi32.exe/[in=epidoc1.in]/?t2000=024211/(100). 32.
Evans, William & Murray, Sheila & Schwab, Robert. (1998). Education-Finance Reform and the Distribution of Education Resources. American Economic Review. 88. 789-812.
Alan L. Montgomery, Victor Zarnowitz, Ruey S. Tsay & George C. Tiao (1998) Forecasting the U.S. Unemployment Rate, Journal of the American Statistical Association, 93:442, 478-493, DOI: 10.1080/01621459.1998.10473696
https://www.pgpf.org/blog/2019/10/income-and-wealth-in-the-united-states-an-overview-of-data